NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Navigating the United States Legislative Landscape on Voice Privacy: Existing Laws, Proposed Bills, Protection for Children, and Synthetic Data for AI

https://doi.org/10.21437/SynData4GenAI.2024-19

Dutta, Satwik; Hansen, John H (August 2024, ISCA)

Privacy is a hot topic for policymakers across the globe, including the United States. Evolving advances in AI and emerging concerns about the misuse of personal data have pushed policymakers to draft legislation on trustworthy AI and privacy protection for its citizens. This paper presents the state of the privacy legislation at the U.S. Congress and outlines how voice data is considered as part of the legislation definition. This paper also reviews additional privacy protection for children. This paper presents a holistic review of enacted and proposed privacy laws, and consideration for voice data, including guidelines for processing children’s data, in those laws across the fifty U.S. states. As a groundbreaking alternative to actual human data, ethically generated synthetic data allows much flexibility to keep AI innovation in progress. Given the consideration of synthetic data in AI legislation by policymakers to be relatively new, as compared to that of privacy laws, this paper reviews regulatory considerations for synthetic data.
more » « less
Full Text Available
Speaker Tracking using Graph Attention Networks with Varying Duration Utterances across Multi-Channel Naturalistic Data: Fearless Steps Apollo-11 Audio Corpus

https://doi.org/10.21437/Interspeech.2023-1258

Shekar, Meena M.; Hansen, John H. (August 2023, ISCA INTERSPEECH-2023)

Speaker tracking in spontaneous naturalistic data continues to be a major research challenge, especially for short turn-taking communications. The NASA Apollo-11 space mission brought astronauts to the moon and back, where team based voice communications were captured. Building robust speaker classification models for this corpus has significant challenges due to variability of speaker turns, imbalanced speaker classes, and time-varying background noise/distortions. This study proposes a novel approach for speaker classification and tracking, utilizing a graph attention network framework that builds upon pretrained speaker embeddings. The model’s robustness is evaluated on a number of speakers (10-140), achieving classification accuracy of 90.78% for 10 speakers, and 79.86% for 140 speakers. Furthermore, a secondary investigation focused on tracking speakers-of-interest(SoI) during mission critical phases, essentially serves as a lasting tribute to the 'Heroes Behind the Heroes'.
more » « less
Full Text Available
What Can an Accent Identifier Learn? Probing Phonetic and Prosodic Information in a Wav2vec2-based Accent Identification Model

https://doi.org/10.21437/Interspeech.2023-2254

Yang, Mu; Shekar, Ram C.; Kang, Okim; Hansen, John H. (August 2023, ISCA INTERSPEECH-2023)
N/A (Ed.)
This study is focused on understanding and quantifying the change in phoneme and prosody information encoded in the Self-Supervised Learning (SSL) model, brought by an accent identification (AID) fine-tuning task. This problem is addressed based on model probing. Specifically, we conduct a systematic layer-wise analysis of the representations of the Transformer layers on a phoneme correlation task, and a novel word-level prosody prediction task. We compare the probing performance of the pre-trained and fine-tuned SSL models. Results show that the AID fine-tuning task steers the top 2 layers to learn richer phoneme and prosody representation. These changes share some similarities with the effects of fine-tuning with an Automatic Speech Recognition task. In addition, we observe strong accent-specific phoneme representations in layer 9. To sum up, this study provides insights into the understanding of SSL features and their interactions with fine-tuning tasks.
more » « less
Full Text Available
Assessment of Non-Native Speech Intelligibility using Wav2vec2-based Mispronunciation Detection and Multi-level Goodness of Pronunciation Transformer

https://doi.org/10.21437/Interspeech.2023-2371

Shekar, Ram C.; Yang, Mu; Hirschi, Kevin; Looney, Stephen; Kang, Okim; Hansen, John H. (August 2023, ISCA INTERSPEECH-2023)
N/A (Ed.)
Automatic pronunciation assessment (APA) plays an important role in providing feedback for self-directed language learners in computer-assisted pronunciation training (CAPT). Several mispronunciation detection and diagnosis (MDD) systems have achieved promising performance based on end-to-end phoneme recognition. However, assessing the intelligibility of second language (L2) remains a challenging problem. One issue is the lack of large-scale labeled speech data from non-native speakers. Additionally, relying only on one aspect (e.g., accuracy) at a phonetic level may not provide a sufficient assessment of pronunciation quality and L2 intelligibility. It is possible to leverage segmental/phonetic-level features such as goodness of pronunciation (GOP), however, feature granularity may cause a discrepancy in prosodic-level (suprasegmental) pronunciation assessment. In this study, Wav2vec 2.0-based MDD and Goodness Of Pronunciation feature-based Transformer are employed to characterize L2 intelligibility. Here, an L2 speech dataset, with human-annotated prosodic (suprasegmental) labels, is used for multi-granular and multi-aspect pronunciation assessment and identification of factors important for intelligibility in L2 English speech. The study provides a transformative comparative assessment of automated pronunciation scores versus the relationship between suprasegmental features and listener perceptions, which taken collectively can help support the development of instantaneous assessment tools and solutions for L2 learners.
more » « less
Full Text Available
DeepComboSAD: Spectro-Temporal Correlation Based Speech Activity Detection for Naturalistic Audio Streams

https://doi.org/10.1109/LSP.2023.3319229

Joglekar, Aditya; Hansen, John H. (January 2023, IEEE Signal Processing Letters)

Speech activity detection (SAD) serves as a crucial front-end system to several downstream Speech and Language Technology (SLT) tasks such as speaker diarization, speaker identification, and speech recognition. Recent years have seen deep learning (DL)-based SAD systems designed to improve robustness against static background noise and interfering speakers. However, SAD performance can be severely limited for conversations recorded in naturalistic environments due to dynamic acoustic scenarios and previously unseen non-speech artifacts. In this letter, we propose an end-to-end deep learning framework designed to be robust to time-varying noise profiles observed in naturalistic audio. We develop a novel SAD solution for the UTDallas Fearless Steps Apollo corpus based on NASA’s Apollo missions. The proposed system leverages spectro-temporal correlations with a threshold optimization mechanism to adjust to acoustic variabilities across multiple channels and missions. This system is trained and evaluated on the Fearless Steps Challenge (FSC) corpus (a subset of the Apollo corpus). Experimental results indicate a high degree of adaptability to out-of-domain data, achieving a relative Detection Cost Function (DCF) performance improvement of over 50% compared to the previous FSC baselines and state-of-the-art (SOTA) SAD systems. The proposed model also outperforms the most recent DL-based SOTA systems from FSC Phase-4. Ablation analysis is conducted to confirm the efficacy of the proposed spectro-temporal features.
more » « less
Full Text Available
Filterbank Learning for Noise-Robust Small-Footprint Keyword Spotting

https://doi.org/10.1109/ICASSP49357.2023.10095436

López-Espejo, Iván; Shekar, Ram C.; Tan, Zheng-Hua; Jensen, Jesper; Hansen, John H. (June 2023, IEEE ICASSP-2023: Inter. Conf. Audio, Speech, and Signal Processing)

In the context of keyword spotting (KWS), the replacement of handcrafted speech features by learnable features has not yielded superior KWS performance. In this study, we demonstrate that filterbank learning outperforms handcrafted speech features for KWS whenever the number of filterbank channels is severely decreased. Reducing the number of channels might yield certain KWS performance drop, but also a substantial energy consumption reduction, which is key when deploying common always-on KWS on low-resource devices. Experimental results on a noisy version of the Google Speech Commands Dataset show that filterbank learning adapts to noise characteristics to provide a higher degree of robustness to noise, especially when dropout is integrated. Thus, switching from typically used 40-channel log-Mel features to 8-channel learned features leads to a relative KWS accuracy loss of only 3.5% while simultaneously achieving a 6.3× energy consumption reduction.
more » « less
Full Text Available
Characterization and normalization of second language speech intelligibility through lexical stress, speech rate, rhythm, and pauses

https://doi.org/10.1121/10.0016224

Kang, Okim; Hirschi, Kevin; Hansen, John H.; Looney, Stephen (October 2022, The Journal of the Acoustical Society of America)

While a range of measures based on speech production, language, and perception are possible (Manun et al., 2020) for the prediction and estimation of speech intelligibility, what constitutes second language (L2) intelligibility remains under-defined. Prosodic and temporal features (i.e., stress, speech rate, rhythm, and pause placement) have been shown to impact listener perception (Kang et al., 2020). Still, their relationship with highly intelligible speech is yet unclear. This study aimed to characterize L2 speech intelligibility. Acoustic analyses, including PRAAT and Python scripts, were conducted on 405 speech samples (30 s) from 102 L2 English speakers with a wide variety of backgrounds, proficiency levels, and intelligibility levels. The results indicate that highly intelligible speakers of English employ between 2 and 4 syllables per second and that higher or lower speeds are less intelligible. Silent pauses between 0.3 and 0.8 s were associated with the highest levels of intelligibility. Rhythm, measured by Δ syllable length of all content syllables, was marginally associated with intelligibility. Finally, lexical stress accuracy did not interfere substantially with intelligibility until less than 70% of the polysyllabic words were incorrect. These findings inform the fields of first and second language research as well as language education and pathology.
more » « less
Full Text Available
Speaker tracking across a massive naturalistic audio corpus: Apollo-11

https://doi.org/10.1121/10.0008574

Chandra Shekar, Meena; Hansen, John H. (October 2021, The Journal of the Acoustical Society of America)

Apollo-11 was the first manned space mission to successfully bring astronauts to the moon. More than + 400 mission specialists/support team members were involved whose voice communications were captured using the SoundScriber multi-channel analog system. To ensure mission success, it was necessary for teams to engage, communicate, learn, address and solve problems in a timely manner. Hence, in order to identify each speaker’s role during Apollo missions and analyze group communication, we need to automatically tag and track speakers individually since manual annotation is costly and time consuming on a massive audio corpus. In this study, we focus on a subset of 100 h derived from the 10 000 h of the Fearless Steps Apollo-11 audio data. We use the concept of “Where’s Waldo” to identify all instances of our speakers-of-interest: (i) Three Astronauts; (ii) Flight Director; and (iii) Capsule Communicator. Analyzing the handful of speakers present in the small audio dataset of 100 h can be extended to the complete Apollo mission. This analysis provides an opportunity to recognize team communications, group dynamics, and human engagement/psychology. Identifying these personnel can help pay tribute to the hundreds of notable engineers and scientists who made this scientific accomplishment possible. Sponsored by NSF #2016725
more » « less
Full Text Available
Block-Based High Performance CNN Architectures for Frame-Level Overlapping Speech Detection

https://doi.org/10.1109/TASLP.2020.3036237

Yousefi, Midia; Hansen, John H. (January 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing)
null (Ed.)
Full Text Available
Assessing Child Communication Engagement via Speech Recognition in Naturalistic Active Learning Spaces

https://doi.org/10.21437/Odyssey.2020-56

Lileikyte, Rasa; Irvin, Dwight; Hansen, John H. (November 2020, ISCA ODYSSEY-2020)

The ability to assess children’s conversational interaction is critical in determining language and cognitive proficiency for typically developing and at-risk children. The earlier at-risk child is identified, the earlier support can be provided to reduce the social impact of the speech disorder. To date, limited research has been performed for young child speech recognition in classroom settings. This study addresses speech recognition research with naturalistic children’s speech, where age varies from 2.5 to 5 years. Data augmentation is relatively under explored for child speech. Therefore, we investigate the effectiveness of data augmentation techniques to improve both language and acoustic models. We explore alternate text augmentation approaches using adult data, Web data, and via text generated by recurrent neural networks. We also compare several acoustic augmentation techniques: speed perturbation, tempo perturbation, and adult data. Finally, we comment on child word count rates to assess child speech development.
more » « less
Full Text Available

« Prev Next »

Search for: All records